[feat][evaluation] experiment retry mode#434
Conversation
CozeLoop
left a comment
There was a problem hiding this comment.
Overall this PR introduces experiment-level retry modes (retry all / retry selected items) and the necessary plumbing across IDL, application, domain, scheduler, and infra layers. I focused on correctness, concurrency, error handling, and IDL integration.
Key findings:
- Found two IDL annotation typos (
api.boyinstead ofapi.body) that will preventitem_retry_numfrom being bound at the HTTP/OpenAPI layer. - Identified a small error-handling issue where a nil error is wrapped, which can be misleading when debugging.
- Suggested tightening invariants around run-log item IDs to avoid accidental duplicates in a single retry request.
The rest of the changes (locking primitives, scheduler modes, retry caps, repo/DAO extensions, and error codes) look conceptually sound and consistent with the existing evaluation pipeline. Please see inline comments for details and concrete fixes.
| 41: optional bool enable_weighted_score (api.body = 'enable_weighted_score', go.tag='json:"enable_weighted_score"') | ||
| 42: optional map<i64, double> evaluator_score_weights (api.body = 'evaluator_score_weights', go.tag='json:"evaluator_score_weights"') | ||
| 43: optional i64 expt_template_id (api.body='expt_template_id',api.js_conv='true', go.tag='json:"expt_template_id"') | ||
| 45: optional i32 item_retry_num (api.boy = 'item_retry_num') |
There was a problem hiding this comment.
[Must Fix] Thrift annotation typo prevents HTTP binding for item_retry_num.
Problem: This field uses api.boy instead of api.body. Hertz/AGW generators only recognize api.body, so item_retry_num from the request body will not be bound into CreateExperimentRequest/SubmitExperimentRequest. The feature will appear to be ignored at the API layer.
Recommendation: Change api.boy = 'item_retry_num' to api.body = 'item_retry_num' for all item_retry_num fields in this Thrift file.
Reference: See Hertz Thrift annotations in your own IDL guidelines and CloudWeGo docs for api.body usage (similar to other fields in this struct).
| 20: optional i32 item_concur_num (api.body = 'item_concur_num') | ||
| 22: optional common.RuntimeParam target_runtime_param (api.body = 'target_runtime_param') | ||
|
|
||
| 45: optional i32 item_retry_num (api.boy = 'item_retry_num') |
There was a problem hiding this comment.
[Must Fix] OpenAPI Thrift annotation typo for item_retry_num.
Problem: This field uses api.boy instead of api.body. As a result, the OpenAPI layer will not bind item_retry_num from the HTTP request body into SubmitExperimentOApiRequest, so clients cannot configure retry times via OpenAPI.
Recommendation: Replace api.boy = 'item_retry_num' with api.body = 'item_retry_num' here to align with other fields and the Hertz binding expectations.
Reference: CloudWeGo/Hertz Thrift annotations for HTTP binding (same pattern you use for item_concur_num and other body fields).
| rl, err = e.runLogRepo.Get(ctx, exptID, runID) | ||
| if err != nil { | ||
| return 0, false, err | ||
| } | ||
|
|
||
| if rl == nil { | ||
| return 0, false, errorx.Wrapf(err, "target runlog %v not found, expt_id: %v", runID, exptID) |
There was a problem hiding this comment.
[Should Fix] Wrapping a nil error here loses context and is misleading.
Problem: In this branch, err is guaranteed to be nil (you already returned on non-nil errors from runLogRepo.Get above). Calling errorx.Wrapf(err, ...) with a nil cause produces an error whose underlying cause is effectively empty and can confuse error inspection/log analysis.
Recommendation: Use a constructor that does not expect a non-nil cause, for example:
if rl == nil {
return 0, false, errorx.New("target runlog %v not found, expt_id: %v", runID, exptID)
}or an equivalent errorx.NewByCode + extra message if you want to reuse a typed code.
Reference: Go Code Review Comments – avoid wrapping nil errors; keep error chains meaningful for errors.Is/As and log inspection.
| func (e *ExptRunLog) AppendItemIDs(itemIDs []int64) error { | ||
| if e == nil { | ||
| return errorx.New("ExptRunLog AppendItemIDs must init first") | ||
| } | ||
| exists := make(map[int64]bool) | ||
| for _, chunk := range e.ItemIds { | ||
| for _, itemID := range chunk.ItemIDs { | ||
| exists[itemID] = true | ||
| } | ||
| } | ||
| rlItems := ExptRunLogItems{CreateAt: gptr.Of(time.Now().Unix())} | ||
| for _, itemID := range itemIDs { | ||
| if exists[itemID] { | ||
| return errorx.NewByCode(errno.EvalItemAlreadyRetryingCode, errorx.WithExtraMsg(fmt.Sprintf("existed item_id: %v", itemID))) | ||
| } else { | ||
| rlItems.ItemIDs = append(rlItems.ItemIDs, itemID) | ||
| } | ||
| } | ||
| e.ItemIds = append(e.ItemIds, rlItems) |
There was a problem hiding this comment.
[Suggestion] Guard against duplicate item IDs within the same retry request.
Problem: exists is built only from the already-persisted ItemIds chunks. If itemIDs passed into AppendItemIDs contains duplicates in the same call (e.g. [1, 1, 2] when ItemIds is currently empty), both entries will be appended because the exists map is never updated for the new IDs. This can lead to redundant scheduling work and makes the run log harder to reason about.
Recommendation: Track newly added IDs in the same exists map so the method enforces uniqueness across both historical and in-request items, for example:
for _, itemID := range itemIDs {
if exists[itemID] {
return errorx.NewByCode(errno.EvalItemAlreadyRetryingCode,
errorx.WithExtraMsg(fmt.Sprintf("existed item_id: %v", itemID)))
}
exists[itemID] = true
rlItems.ItemIDs = append(rlItems.ItemIDs, itemID)
}This keeps the invariant "each itemID appears at most once per run log" regardless of client input.
Reference: Go Code Review Comments – keep invariants explicit and avoid silent duplication of identifiers.
| github.com/google/renameio v0.1.0/go.mod h1:KWCgfxg9yswjAJkECMjeO8J8rahYeXnNhOm40UhjYkI= | ||
| github.com/google/s2a-go v0.1.8 h1:zZDs9gcbt9ZPLV0ndSyQk6Kacx2g/X+SKYovpnz3SMM= | ||
| github.com/google/s2a-go v0.1.8/go.mod h1:6iNWHTpQ+nfNRN5E00MSdfDwVesa8hhS32PhPO8deJA= | ||
| github.com/google/subcommands v1.2.0 h1:vWQspBTo2nEqTUFita5/KeEWlUL8kQObDFbub/EN9oE= |
There was a problem hiding this comment.
[No Risk] No issues found with this dependency change.
The added checksum for github.com/google/subcommands v1.2.0 is consistent with the existing go.mod entry and does not introduce any obvious risks for the backend. Version choice and scope of usage look reasonable for CLI-style helpers.
There was a problem hiding this comment.
[No Risk] DAO extensions for batch queries and raw run-log scanning look fine.
MGetItemResults/MGetItemTurnResults and the RawFilter branch in ScanItemRunLogs use appropriate query builders, respect the existing index (uk_expt_run_item_turn), and keep pagination semantics intact. Error wrapping is consistent with the rest of the DAO layer.
There was a problem hiding this comment.
[No Risk] New error codes for retry flows are well scoped.
EvalItemAlreadyRetrying and ExperimentIsCompleting are clearly named, have stable, non-overlapping codes, and their messages match how they are used in the domain layer (e.g., guarding duplicate retry items and runs in completing state). They are marked as not affecting stability, which aligns with their semantics.
There was a problem hiding this comment.
[No Risk] Domain IDL changes for item_retry_num are consistent with the model.
Introducing item_retry_num on Experiment and ExptFieldMapping fits naturally into the existing evaluation configuration and is wired through the convertors and templates. Field IDs avoid collisions and preserve backward compatibility.
There was a problem hiding this comment.
[No Risk] Extending ExptRunMode with explicit retry modes improves clarity.
Adding EvaluationModeUnknown, EvaluationModeRetryAll, and EvaluationModeRetryItems makes the run-mode enum more expressive and is consumed consistently across the scheduler and application layers. No issues spotted with numeric values or compatibility.
There was a problem hiding this comment.
[No Risk] Deleting this temporary file_name artifact is safe.
This looks like a stray CSV header/sample file that does not belong in the domain service package. Removing it reduces noise in the repo and has no impact on runtime behavior.
What type of PR is this?
feat
Check the PR title
(Optional) Translate the PR title into Chinese
(Optional) More detailed description for this PR(en: English/zh: Chinese)
en:
zh(optional):
(Optional) Which issue(s) this PR fixes